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ABSTRACT 

A predictive adaptive testing (PAT) strategy was 
developed based on statistical predictive analysis, and its 
feasibility was studied by comparing PAT performance to those of the 
Flexilevel, Bayesian modal, eind expected a posteriori (EAP) 
strategies in a simulated enviroiunent. The proposed adaptive test is 
based on the idea of using item difficulty and past information 
(observed data) about an examinee to acquire the probability of 
answering further items correctly. Development of the PAT model is 
described with reference to: (1) initial items; (2) scoring method? 
(3) selection of subsequent items to be administered? and (4) 
terminating criteria. The model was compared to the Flexilevel, 
Bayesian modal, and EAP strategies in a Monte Carlo simulation study 
in which the ability levels of 999 examinees were generated using a 
71-item test. The strategies performed similarly at the low ability 
level. At the medium level, the Bayesian modal and EAP strategies 
were the most efficient. At the high level, the Bayesian modal 
strategy required fewer items than did the PAT and the EAP 
strategies. The three strategies produced similar results in terms of 
error variance and ability estimates. The PAT is potentially useful, 
particularly in small classroom testing. There are 12 tables of study 
data, 2 figures, and a 14-item list of references. (SLD) 
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ESTIMATION OP ABILITY LEVEL BY USING ONLY OBSERVABLE QUANTITIES 

IN ADAPTIVE TESTING 

The Objectives of this study were (a) to develop the 
predictive adaptive testing (PAT) strategy which was based on 
statistical predictive analysis; and (b) to investigate its 
feasibility by comparing the performance of PAT to those of the 
Flexilevel (Lord, 1971, 1980), Bayesian modal (Assessment Systems 
corporation, 1990) and expected a posteriori (EAP) (Bock & Aitken, 
1981) strategies in a simulated environment. 

MODEL 

Predictive statistical Analysis in Educational Testing 

Much of statistical analysis is concerned with making 
inferences about the distributions of unknown parameters. In 
educational testing, the parameter 0 usually represents the ability 
or trait of an examinee to be measured and an educational test is 
a tool that quantifies his/her ability level in .ome way to obtain 
a numerical score. This educational test could be a fixed-length 
paper-and-pencil conventional or an adaptive test. 

The proposed adaptive test is based on the idea of using item 
difficulty p and past information (observed data) x about an 
examinee-in this case it will be the number of correct scores 
during the testing up to a certain point— to acquire his/her 
probability of answering future item(s) correctly. 

The statistical predictive analysis is composed of two 
experiments: informative experiment e and future experiment f. 
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Each informative experiment e^ is an experiment that is performed 
in the past and its typical outcome is denoted by x^, where 

1 if response is correct, 

x,-( 

0 otherwise, 

that is distributed as a Bernoulli variable with parameter 6, 

f(x^;6)= e''^(l-0)^-''^ 
The informative experiment e involves responses to items that 
have already been administered. The future experiment also 
involves item(s) that will be administered to the examinee 
following the items already administered during the informative 
experiment e. Likewise, the outcome of the future experiment f,, 
y^, is a dichotomously scored item, 

1 if response is correct, 

0 otherwise . 

Then, the niimber of correct scores in future y=Syj is 
distributed binomially with parameter 8, f{yj,e), if items are 
independent and probability of y^^l is constant across the items. 

The informative experiment e conveys information to the future 
experiment f about the performance of an examinee up to a 
particular point through the ability parameter 8 that is assumed to 
be fixed (Aitchison & Dunsmore, 1975, p. 19). This is the only link 
between these two experiments. The second assumption suggested 
that for a given examinee, his/her response to the previous items 
do not affect the response to the future item{s) . This assumption 
is similar to the local independence assumption in item response 
theory (IRT) . In simulation study, this can easily be met. 
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However, in real testing situations, an examinee's response to the 
previous itein(s) may affect the response to the future item(s) . 



DEVELOPMSMT OP THE MODEL 
The development of the PAT model can be^t be described according to 
the components of adaptive testing. These components are: (a) 
initial (entry-level) item, (b) scoring method, (c) selection of 
the subsequent items to be administered, and (d) terminating 
criterion . 

initial (Entry-Level) Item 

In general, the prior distribution contains some information 
about the parameter 0. An investigator intends to generate more 
accurate inferences about the parameter 8 by using the prior 
information. Since generation of a posterior distribution is simp- 
lified if the prior and likelihood densities belong to the same 
conjugate family, the prior distribution of ability is assumed to 
be a beta with a location parameter g>0 and a scale parameter h>0, 
in predictive adaptive testing: 

r(g+h) 

f(e)= 09-1(1-6)*'"% O<0<X (1) 

r(g)r(h) 

where ability parameter 6 is in the range of 0 and 1. 

The selection of the entry-level item is closely related to 
the prior distribution of an ability. Since at the beginning of 
the testing there is no informative data, the total number of 
correct answers x and total number of items already administered n 
are 0 and 0, respectively. Therefore, the probability of answering 
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the initial item correctly given item difficulty p and no observed 
data, f (y=l;p,x=0) , equals the mean value of the beta (prior) 
distribution, g/(g+h) , where g and h are location and scale parame- 
ters, respectively. Thus, the initial item selected is the one 
whose item difficulty level is closest to the mean of the prior 
distribution. 
Scoring Method 

The likelihood function L(0) of item responses (x,,-..,x„) is 
the multiplication of Bernoulli distributions, 
f(x,;e)= 0'**(l-e)^*'. Thus, 

L(0)= T f(Xi;e)= T O'^'d-e)'"'^ = 8^(1-0)^'' (2) 
i i 

where O<0<1, x=Sx^ and x=0,l, — ,n. 

Then, the posterior distribution is a beta distribution with 

density 

f (0;x) = {constant) .0»*s-^(l- 0)'^''-*'\ (3) 
where constant= r(n+g+h)/ (r(x+g)r(n+h-x) ) . The mean of this 
distribution is (x+g)/ (n+g+h) and variance is (x+g) (n4•h- 
x)/ (n+g+h) Mn+g+h+1) . As mentioned before, the probability 
assessment about the unknown parameter 0 is not the final objective 
of the predictive analysis. The main purpose is to assess a 
probability about the future outcome y given informative data x 
without the unknown parameter 0. Thus, the predictive density 
function can be expressed as 

f (y?x)-If (y;0)f (0;x)d0 (4) 
n 

where f(y?0), which describes the future experiment, is distributed 
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binomially with parameter e and sample size m (number of items to 
be administered in the future) , 

f (y;0)= ( " )e^(l-0)'"^ (5) 

where y-0,l,...,m. Then, the predictive distribution, beta- 
binomial, for y given f(8) and x, can be written as 

^ r (u+v) r (y+u) r (m+v-y) 
f(y;x)= ( ) 

y r(u)r(v)r(m+u+v) 

where y=0,l,.,.,m, u=x+g, and v=n+h-x (Ferguson, 1967). The mean of 
this distribution is mu/(u+v) and the variance is 
iauv(m-»-u+v)/(u+v)2(u+v+l) . 

Figure 1 

The Basic Steps Leading to the Predictive 

Distribution^ 



L(e)^ f(9) 



^f (e?x) 



3 
i 

f (y;x 




Figure 1 summarizes the basic steps leading to the predictive 
distribution. The arrows 1 and 2 converge to the f{e?x) that is a 



^The figure presented here is provided by Aitchison and 
Dunsmore, 1975. 



ERIC 



7 



% 



6 

result of Bayesian theorem. From that point, posterior 
distribution together with the distribution of future outcome y, 
arrows 3 and 4, are combined by using the definition of predictive 

distribution in (4) . 

Predictive distribution f (y?x) is the best approximation to 
the f(y;6) (Aitchison, 1975) that describes the examinee's future 
performance. To find the ability estimate of an examinee, i.e., 
the probability of answering next item correct given item diffi- 
culty p and niimber of correct scores x, f(y=l;PfX), the 
proportionality of f(y=l?p,x) to f (p;y=l,x) f (y=l;x) is used 
(Hacking, 1965) . f(y=l;x) is the predictive probability and 
f(p;y=l,x) is the posterior probability of item difficulty given 
past (observed data) and future information of an examinee. The 
item difficulty p is calculated as the proportion of total group 
responding an item incorrectly. To obtain the posterior 
distribution of item difficulty p, a prior distribution for item 
difficulty p is defined as a beta distribution with certain scale 
1>0 and location parameter lc>0. The resulting posterior 
distribution is again distributed as a beta with parameters k+x and 
1-x, where x=EXi-. Therefore, after terminating the test, 
f(y=l;p,x) which is the probability of answering next item 
correctly, y=l, given item difficulty p and the number of correct 
response to items already administered, x, will be regarded as an 
ability estimate of an examinee. Thus, the probability f (y=l?p,x) 
combines the information from item difficulty, observed data and 
examinee's ability level. 
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SeXeotion of Subsequent items to be adoinistered 

To find the most appropriate item to administer to an 
examinee, the following criterion is considered: 

min I f (y=lix)-f (y=i?p,x) |, (7) 
where f(y=l;x) il the predictive probability of answering the next 
item correctly given an examinee's number of correct scores to the 
items already administered. The item difficulty i-arameter p is 
calculated as the proportion of total group responding an item 
incorrectly. The above criterion is constructed by considering the 
following relations; (a) for a given adequately large item pool, 
almost perfect positive correlation between f (y=l;x) and f (y=l;p,x) 
that is the probability of answering the next item correctly given 
item difficulty and number of correct scores; and (b) also high 
negative correlation between f(y=l;p,x) and item difficulty p 
(OSp<l). According to the above criterion, the most appropriate 
item to be administered is the one with item difficulty that is 
closest to his/her predictive probability. In adequately large 
item pool, it can be shown that the values of f(y=l;x) and 
f(y=l?P.x) are similar for an item selected according to the 
criteria (7) specified above. Therefore, they both can be used as 
an ability estimate of an examinee. Thus, the most appropriate 
item to be administered is the one whose item difficulty is closest 
to the examinee's ability level. 

Termination criteria 

There are two widely used termination criteria in literature: 
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(a) testing continues until a prespecified number of items are 
administered, (b) testing continues until a prespecified value of 
an information function or standard error of estimate is reached. 
In predictive adaptive testing, a combination of these two widely, 
used termination criteria are employed. That is, testing will 
continue until either a prespecified number of items are 
administered or a prespecified value of standard error of estimate 
is reached. 

The standard error of estimate obtained from a posteriori 
distribution of ability (3) is considered as a termination 
criterion. The following beta distribution in (3) is derived as a 
posterior distribution of ability in the process of extraction of 
predictive distribution, beta-binomial, 

f(0;x) = (constant). e^^s-^Cl- e)"^^'*-', (8) 

where (constant) = r(n+g+h)/ (r(x+g)r(n+h-x) ) . The mean of this 
distribution is (x+g)/ (n+g+h) and variance is (x+g) (n+hx)/ 
(n->-g+h)» (n+g+h+l). The parameters g and h stand for the location 
and scale parameters of a prior distribution of ability, x denotes 
the number of correct scores out of n items already administered. 
Testing will continue until the square root of the variance of the 
above beta distribution reaches the prespecified value. As a 
result, after terminating the testing, predictive adaptive testing 
provides a final predictive probability, f(y=l?p,x) or f(y=l;x), 
both can be used as an ability estimate. 
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METHOD MSD PROCEDURE 

The performance of predictive adaptive testing was compared to 
those of the flexilevel, the Bayesian modal and EAP strategies. In 
order to show the feasibility of predictive adaptive testing, data 
were generated by the Monte-Carlo simulation technique. 
Generation of Population 

In this Monte-Carlo simulation study, each examinee was 
identified by a numerical value reflecting their ability level, 6. 
Ability levels of a total of 999 examinees were randomly generated 
from a standard normal distribution in the interval of 3.0 to +3.0. 

The seventy-one- item test was generated by assuming that the 
discrimination parameter a was distributed uniformly in the 
interval of 0.19 to 1.69 (Hambleton & Traub, 1971) . The difficulty 
level b was distributed normally with mean 0 and variance 1 in the 
interval of -3.0 to +3.0. Finally, the guessing parameter c was 
assumed to be uniformly distributed in the interval of 0 to 0.20. 
In order to simulate the responses of 999 examinees to the seventy- 
one-item test, the siibprograms of IMSL (1984, version 9.2) library 
on PITT VAX/VMS system were used. 

The dichotomous (0, 1) score of any examinee on any item was 
a probabilistic function of their ability level 9, the item 
difficulty fe, and the parameters a and c. The probability P^CBj) 
of a correct response under the 3 -parameter logistic model item 
characteristic curve was calculated according to the following 
formula 
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1-C, 

where i and j denoted item and examinee, respectively. In order to 
simulate dichotomous item response, each probability value P,(ep 
was compared with a random number r,j which was generated from a 
uniform distribution in the interval of 0 to 1. The response was 
assumed correct and a score of 1 was assigned, if the probability 
value was equal or greater than the random number r,-j; otherwise a 
score of 0 was assigned. 

The Program ASCAL (Assessment Systems Corporation, 1990) was 
used to estimate ability and item parameters based on the generated 
item responses from 999 examinees. Chi-squared goodnpss-of-f it^ 
tests for the true and estimated values of ability and f the true 
and estimated values of item parameters were carried out in order 
to provide an evidence for how well the data generation process 
worked. 

The calculated chi-squared values are presented in Table 1 for 
ability parameter 6 and item parameters a, fer S- It was concluded 
that the estimated values of the ability parameter 6 and item 
parameters w©r« not significantly different from their generated 
values. 



2y'=S(0.-E,)/E- is distributed as a chi-squared with k-l 
degrees of freedom; where k is the number of categories and O, and 
E{ are observed and expected values, respectively. 
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Table 1 

Chi-Squarod Goodness-of-Pit Test Results 



Test 


df 


Chi-sq^i? red 
test value 


Ability, 6 
Discrimination, a 
Difficulty, fe 
Guessing, s. 


89 
70 
70 
70 


19.787 
6.456 

42.249 
8.317 



Note: The goodness-of-f it tests are non-significant at the 
a«0.05 level. 

Table 2 

conventional Item statistics for Raw Scores 



Number of items 
Number of examinee 
Mean 

Variance 
Std.Dev. 
Skewness 
Kurtosis 
Mean p 

Mean it em- total 



71 
999 

37.856 
167.929 
12.959 
0.036 
-0.675 
0.533 
0.405 



Minimiim 5 
Maximum 71 
Median 38 
Alpha 0.926 
SEM 3.532 
Mean Bis 0.545 



The program XTEMAN (Assessment Systems Corporation, 1990) was 
employed to calculate the conventional item statistics such as 
proportion correct, biserial correlation, and point-biserial 
correlation. Furthermore, the alpha-reliability coefficient was 
calculated, 0.926. The results in Table 2 suggested that the 71- 
item test adequately represented examiners in the medium ability 
gr'^up. 
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Finally, the test for unidimensionality of the abilit. pace, 
which is assumed by IRT, was performed. According to the test 
proposed by Reckase (1979), inter-item correlation coefficients 
were calculated in order to find the eigenvalues. The test results 
showed that the first eigenvalue accounted for 37% of the total 
variance which was greater than the recommended 20% value. 
Therefore, the assumption of unidimensionality of the ability space 
appeared to be reasonable. 
Sample 

Examinees were grouped into three different ability levels 
based on their randomly generated true abilities. In order to 
assign each examinee to one of the three groups of low, medium, or 
high, examinees were ranked according to their generated true 
ability level. Then, the examinees were clustered into nine 
mutually exclusive groups in such a way that each section contained 
an equal number of examinees, i.e., 111. From each section, ten 
examinees were randomly selected. Thirty examinees from the top 
three ability sections were grouped into the high ability group. 
Similarly, the same number of examinees from the bottom three 
ability sections were classified as a low ability group. The 
remaining examinees formed the middle ability group. 
Prooedure 

The Bayesian modal, EAP and PAT strategies required the 
specification of prioir distribution about the examinee's ability 
level. The medium ability level assumption was the only one 
assumed for all strategies requiring the specification of prior 
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distribution. For the Bayesian modal and EAP strategies, the mean 
and variance of normal distribution were specified as 0 and 1, 
respectively. Since IRT-based adaptive testing strategies and PAT 
were based on different distributional assumptions, the prior 
distributions were not perfectly comparable. However, in this 
case, the prior was a beta distribution with the location and scale 
parameters g=2 and h=2, respectively. Since this beta 

distribution is symmetrical, its mean, mode and median values were 

all equal to 0,5. 

TWO termination criteria were used in the present study: In 
determining the ability estimate of an examinee and the final 
standard error of estimate, thirty-six items were administered to 
every examinee. This maximum number of items administered was 
required by the 71-item flexilevel test. Therefore, the comparison 
of the ability estimates and the final error variance of the 
ability estimates from different strategies were based on the same 
number of items. In determining the number of items required to 
reach the prespecified termination criterion, for the Bayesian 
modal, EAP and PAT strategies, the standard error of estimate that 
was calculated from the expected test information was set to 0.30. 

To simulate the adaptive testing for the predictive and 
flexilevel strategies, Fortran IV computer programs were prepared. 
Items were selected according to the adaptive testing strategies 
and the corresponding response (correct or incorrect) was entered 
by the program itself. For the Bayesian modal and EAP strategies, 
MicroCAT was used to administer adaptive testing. When the program 
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selects the appropriate item to administer, that particular item 
was seen on the screen. The investigator then entered the response 
either correct or incorrect based on the examinee's simulated res- 
ponse . 

In order to assess the accuracy of the performance of PAT, the 
correlation coefficients were calculated between ability estimate 
(final predictive probaoility) obtained from PAT and the generated 
ability score. Furthermore, the correlations between the generated 
ability score and the other ability estimates obtained from 
flexilevel, the Bayesian modal and EAP were computed as well. The 
test of equality for the above correlation coefficients were 
carried out in order to examinee the similarity between estimated 
and true ability scores in terms of order of scores. 
Data Colleotion 

The following data were collected for each strategy: 

1. item identifier? 

2. subject's response; (0,1), 

3. flexilevel tfest score, ability estimate scores obtained 
from the Bayesian modal, EAP and PAT— final predictive probability 
was used as an ability estimate for PAT— strategies; 

4. the final error variance" of ability estimate, i.e., 
standard error of posterior distribution for the Bayesian modal, 
EAP and PAT strategies; and 

5. the number of items required to reach a prespecified 

terminating criterion. 
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Data Analysis 

The independent variables that were considered are as follows: 

1. adaptive testing strategy, and 

2. ability levels. 

The ability level (high, medium, low) was regarded as a 
between-subjects variable. On the other hand, the adaptive testing 
strategy (flexilevel, PAT, the Bayesian modal, the EAP) was 
considered to be a within-subjects variable. In this study, a two- 
way mixed factorial design with repeated measures on one of the 

factors was used. 

The dependent variables that were considered are: 

1. The number of items required for each strategy to reach a 
prespecified terminating criterion. This dependent variable was 
the indicator of efficiency in adaptive testing? 

2. The absolute value of the difference between generated 
true ability and estimated ability scores obtained from flexilevel, 
the Bayesian modal, EAP and PAT strategies. Since the ability 
estimates obtained from IRT-based adaptive testing strategies, 
flexilevel and PAT could not be compared on the same metric—due to 
the difference in distributional assuruptions, the difference was 
calculated between the standardized scores. Thus, the comparisons, 
in some sense, were made possible. Furthermore, the absolute value 
of the difference was taken in order to show the accuracy and the 
similarity of the obtained scores? and 

3. The absolute value of the difference between error 
variance of the final estimate obtained from adaptive testing 
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strategy and error variance when the complete test was considered. 
Due to the reasons mentioned in the previous paragraph, all the 
error variances of ability estimates were transformed to 
standardized scores before taking the differences. This dependent 
variable was an indicator of similarity between error variances ob- 
tained from adaptive testing strategy and error variance of the 
complete test (true error variance) . 

The following hypotheses were tested: 

Hq^: There is no significant difference between means of 
examinees for different adaptive testing strategies for each of the 
dependent variables 1-3, 

Hq2' There is no significant difference between means of 
examinees for three different ability levels for each of the 
dependent variables 1-3, 

Hqj: There is no significant interaction effect of the 
adaptive testing strategy and ability level for each of the 
dependent variables 1-3. 

Since the flexilevel test administers the same number of items 
to each examinee, it was excluded from hypotheses testing when the 
first dependent variable was considered. For the second dependent 
variable, all four adaptive testing strategies were included. How- 
ever, for the third dependent variable, since the error variance 
could not be calculated for the flexilevel test, it was excluded 
from hypotheses testing. 
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RESULTS AND DISCOSSIOMS 
iminbor of Jt&as Required to Reaob the Prespecified 
Tezmlnation criterion 

The strategies that were considered here were the Bayesian 
modal, EAP, and PAT. The preliminary studies showed that the raw 
scores of the number of items required did not meet the assumptions 
to carry out F-tests in mixed factorial design (Kirk, 1982, p-74), 
i.e., the observations were not normally distributed and variances 
were not equal. Therefore, an angular transiormation (Kirk, 1982, 
p. 83) of the observed scores was performed. The cell and marginal 
means corresponding to the adaptive testing strategies and ability 
groups are summarized in Table 3. The results of two-way mixed 
factorial design in Table 4 revealed that the interaction effect 
between adaptive testing strategy and ability group was 
statistically significant at a=0.01. In order to show the nature 
of the interaction effect, Figure 2 was plotted by considering cell 
means provided in Table 3. The plot indicated that, at the low 
ability level, the PAT strategy required more items to reach the 
prespecified termination criterion than the Bayesian modal and EAP. 
However, the pairwise mean differences calculated according to the 
Scheffe post-hoc method, at the low ability level, were not statis- 
tically significant (see Table 5) . 
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Tabid 3 

cell and Marginal Means of the Huabor of Items Required 
to Reach the Prespecified Termination Criterion 





Low 


Mediiun 


High 


Marginal 


Modal 

EAP 

PAT 


17.43 
19.97 
17.47 


10.87 
12.40 
19.63 


12.80 
14.90 
19.73 


13.70 
15.76 
18.94 


Marginal 


18.29 


14.30 


15.81 





Table 4 

Results of the Mixed Factorial Design on the Number of 
Items Required to Reach the Prespecified Temlnation 
Criterion in Terms of Angular Transformation 



Sources 


SS 




df 


MS 




F-ratio 


p-value 


Mean 

Ability (A) 


0. 

3. 


,05 
.08 


1 
2 


0. 

1. 


,05 
.54 


0 
3 


.11 
.42 


0.744 
0.037 


Error (A) 


39. 


.17 


87 


0. 


.45 








Strategy (S) 
S X A 


31- 
11. 


.26 
.17 


2 
4 


15. 

2. 


.63 
.79 


42 
7 


.38 
.57 

m 


0.000 

O.OOO 


Error (S) 


64. 


.18 


174 


0, 


.37 









If the starting point matched with the actual ability level, 
the medium ability level, the Bayesian modal and EAP required less 
number of items than the PAT strategy. As can be noticed in Table 
5, the pairwise mean differences between PAT and EAP and also PAT 
and the Bayesian modal were statistically significant at o=0.0l 
level . 



I' 8 



19 



Table 



p-Values of scheff e Test for Adaptive Testing Strategy 
and Ability Group on the Munber of Items Required to 
Reach the Prespecified Termination Criterion 
in Terms of lingular Transformation 



Ability EAP PAT 

Group 



Low 

Medium 
High 



Modal 
EAP 

Modal 
EAP 

Modal 
EAP 



2.043 
1.521 
5.461* 



2.177 
0.134 

7.505* 
5-983* 

6.194* 
0.732 



* p<0.01 

Figure 



interaction Effect Between Adaptive Testing Strategy 
and Ability Group on the Number of Items Required 
to Reach the Prespecified Termination Criterion 
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At the high ability level, the number of items required 
increased for the Bayesian modal and isAP strategies. However, for 
PAT strategy, the number of items required was higher than those of 
Bayesian modal and EAP. The post-hoc comparison for pairwise mean 
difference between PAT and EAP strategies, at the high ability 
level, was not significant at a=0.oi. The Bayesian modal strategy 
required significantly fewer number of items than the PAT and EAP. 

In summary, the results revealed that at the low ability 
level the number of items required by the three adaptive testing 
strategies were not significantly different. The Bayesian modal 
and EAP strategies required significantly fewer number of items 
than the PAT when the starting point matched with the actual 
ability level. At the high ability level, the Bayesian modal 
strategy required significantly less number of items than the PAT 
and EAP. 

2a»solute value of the Difference Between 
Standardised Ability Estimate and Generated Ability 

The second dependent variable was the absolute value of the 
difference between standardized ability estimate obtained from the 
adaptive testing strategies and generated ability scores. The data 
were analyzed by two-way mixed factorial design. The first factor 
was the adaptive testing strategy (the Bayesian modal, EAP, and 
PAT). The second factor was the ability group, i.e., low, medium, 
and high. 

For the same reasons mentioned in preceding section, a 
transformation of data was necessary to meet the assumptions of 
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normality and homogeneity of variances. The data were transformed 
by using the square-root method (Kirk, 1982, p. 82) . The cell and 
marginal means are presented in Table 6. The results of two-way 
mixed factorial design were summarized in Table 7. 

Table 6 

cell and Marginal Means of the Obtained ability Estimate 
and Generated True Ability in Terms of 

Raw Scores 





Low 


Medium 


High 


Marginal 


True 


-1.22 


-0.05 


0.94 


-0.11 


Flex 


0.38 


0.53 


0.68 


0.53 


Modal 


-1.15 


-0.06 


0.96 


-0.08 


EAP 


-1.17 


-0.06 


0.93 


-0.10 


PAT 


0.32 


0.41 


0.55 


0.42 



The test results showed that the interaction effect between 
adaptive testing strategy and ability group in terms of 
square-root of the absolute value of the difference between 
standardized ability estimate and standardized generated ability 
score was not significant at the a=0.0l level. Therefore, the next 
step in data analysis was to test the main effects due to the 
adaptive testing strategy and ability group. 
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Ta]9le 7 



Results of the Hised Factorial Design on the 
^solute Value of Difference Between Standardized 
Ability Estimate and Generated Ability in Terms of 
Square-Root Transformation 



Source 


SS 


df 


MS 


F-ratio 


p- value 


Mean 

Ability(A) 
Error (A) 


70.11 
0.15 
4.07 


1 
2 
87 


70.01 
0.08 
0.05 


1496.52 
1.65 


0.000 
0.198 


Strategy (S) 
S X A 
Error (S) 


0.62 
0.29 
8.93 


3 
6 

261 


0.21 
0.05 
0.03 


6.06 
1.42 


0.001 
0.208 



Table 7 revealed that the main effect of adaptive testing 
strategy was significant at a=0.01. The post -hoc comparisons of 
pairwise mean differences were calculated by using the Scheffe 
method and are summarized in Table 8. According to the results 
presented in Table 8, the pairwise mean differences between 
adaptive testing strategies were all non-significant at the a=0.01 
level. The pairwise comparisons computed by the Scheffe method 
were not able to detect any significant mean differences between 
adaptive testing strategies. On the other hand, the main effect of 
ability group was found to be non-significant at the a=0.01 level. 
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Table 8 

F-Values of Sehef fe Test for the Main Effect of 
Adaptive Testing Strategy on the Absolute Value of 
Difference Between standardized Ability Estimate and 
Generated Ability score in Terns of 
square-Root Transformation 





Modal 




PAT 


Flex 


0.548 


0.629 


2.164 


Modal 




0.082 


2.711 


EAP 






2.793 



In summary, the main effects of adaptive testing strategy and 
ability group were additive. Although the main effect of adaptive 
testing strategy was significant, the post-hoc comparisons did not 
reveal any significant pairwise differences between the means of 
adaptive testing strategies. 

Absolute Value of Difference Between Standardized Error 
variances of Ability Estimate and complete Test 

Since the flexilevel test did not yield any error variance of 
ability estimate, it was not included into the statistical 
analysis. The strategies which were considered here were the 
Bayesian modal, EAP, and PAT. 

Due to the procedural differences among adaptive testing 
strategies, IRT-based adaptive testing strategies and PAT did not 
produce comparable error variances. All the error variances of 
abilities were trapsformed to z scores before taking the absolute 
value of differences. These absolute value of differences that 
were taken between the error variance obtained from the complete 
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test and error variance obtained from adaptive tests showed the 
accuracy . 

For the same reasons mentioned in previous sections, a 
transformation of data was necessary to meet the assumptions for 
normality and homogeneity of variances. The scores were trans- 
formed by using the logarithmic trans format ion. 

The cell and marginal means are presented in Table 9. The 
results of mixed factorial design were siimmarized in Table 10. The 
test results of mixed factorial design showed that the interaction 
effect between adaptive testing strategy and ability group in terms 
of logarithmic transformation of the absolute value of the differ- 
ence between standardized error variances was not 
significant at a=0.01 level. The tests for the main effects due to 
the adaptive testing strategy and ability group revealed that the 
main effects of adaptive testing strategy and ability group were 
not significant at a=0.01. According to the above results, the 
means of the error variances produced by the Bayesian modal, EAP 
and PAT, were statistically similar. 

Table 9 

Cell and Marginal Means of the Error variances 
Obtained from Complete and Adaptive Tests 
in Terms of Raw Scores 



Low Medium High Marginal 



True 


.03 


.04 


.03 


.03 


Modal 


.07 


.05 


.05 


.06 


EAP 


.08 


.05 


.06 


,06 


PAT 


.05 


.06 


.06 


.06 
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Table 10 



Results of the Mixed Factorial Design on the 
AhsoXute Value of Difference Between Standardised 
Error Variances Obtained fron Adaptive Testing 
strategies and Complete Test in Terms of 
Logarithmic Transformation 



Source 


SS 


df 


MS 


F-ratio 


p- value 


Mean 


14.58 


1 


14.58 


11.86 


0.001 


Ability (A) 


2.59 


2 


1.29 


1.05 


0.354 


Error (A) 


106.94 


87 


1.23 






Strategy (S) 


2.25 


2 


1.12 


1.23 


0.295 


S X A 


8.51 


4 


2,13 


2.33 


0.058 


Error (S) 


158.98 


174 


0.91 







Corrdlation Coefficients Between Ability Estimate and 
Generated Ability 

In the final section, the correlation coefficients between 
ability estimates obtained from adaptive testing strategies and 
generated ability scores were computed. The results were summar- 
ized in Table 11. 

Table 11 

Correlation coefficients Between True Ability and 
Ability Estimates and Also Between Ability Estimates 



True 

Flex 

Modal 

EAP 

PAT 



Flex 


Modal 


EAP 


PAT 


0.966 


0.971 


0.976 


0.933 


1.000 


0.934 


0.925 


0.916 




1.000 


0.980 


0.S89 






1.000 


0.876 








1,000 



Note: "True" stands for generated true ability score. 
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The results showed that all the correlation coefficients 
presented in Table 11 were statistically significant at the o=0.ooi 
level. The correlation coefficients between generated true ability 
and ability estimates obtained from adaptive testing strategies 
were all above 0.93. This revealed that all the ability estimates 
obtained from adaptive testing strategies were highly correlated 
with the generated true ability scores. The EAP strategy had the 
highest correlation coefficient (0.976). 

The test for equality of the above correlation coef- 
ficients (Glass & Stanley, 1970, p. 313) such as 
corr(True, Flex) =corr( True, Modal) are suiwnarized in Table 12. 

Table 12 

Test for Equality of Correlation coefficients Between 
True Score and Ability Bstimates 



(True, Modal) (True, EAP) (True, PAT) 



(True, Flex) -0.7348 -1.5481 2.7247 

(True, Modal) -0-8051 2.5797 

(True, EAP) 3.6108 



* p<0.01 

The results showed that, in terms of correlations with true 
score, PAT is significantly different from the flexilevel and EAP 
at 0-01 level. However, all the other correlations coefficients 
between adaptive test scores and true scores were not significantly 
different. 
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SUMMAKY AMD COMCIiUSZOM 

A model using predictive statistical analysis was developed. 
The feasibility of the model was compared with other adaptive 
testing strategies in a simulation study. The results of the data 
smalysis can be svunmarized as follows: 

1. In terms of n>imber of items administered to reach the 
prespecified termination criterion, all the three adaptive testing 
strategies performed similar at the low ability level. At the 
medium ability level, the Bayesian modal and EAP strategies were 
the most efficient ones. At the high ability level, the Bayesian 
modal strategy required significantly less number of items than the 
PAT and EAP. 

2. In terms of the absolute value of the difference between 
standardized ability estimate and generated ability score, all the 
strategies yielded statistically comparable estimates. 

3. In terms of the absolute value of the difference between 
standardized error variances, all the adaptive testing strategies, 
the Bayesian modal, EAP, and PAT, produced equally comparable and 
similar results. 

4. As a final analysis, the correlation coefficients were 
calculated between ability estimates obtained from adaptive testing 
strategies and generated true ability score- The results showed 
that all the correlation coefficients were comparable and highly 
significant. The tests for the equality of the correlation 
coefficients, mentioned above, revealed that the PAT and Bayesian 
modal strategies produced significantly similar ability estimates 
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to the true ability scores in terms of the order of scores. 

The performance of PAT was not quite as efficient as the 
Bayesian modal and EAP strategies at the middle ability level in 
terms of number of items required. However, PAT produced similar 
results in terms of error variance. When ability estimates were 
considered, all the adaptive testing strategies produced equally 
comparable results. 

Based on the results of this study, it can be concluded that 
PAT has a potential to be utilized. Since IRT-based adaptive 
testing strategies require a larger sample size to calibrate item 
parameters and some assumptions to be met, the implementation of 
PAT into small classroom testing is more practical. 
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